Fix Batch Size Calculation for Multi-GPU Training #5

cosineai · 2024-11-15T19:02:23Z

This pull request addresses the issue of inconsistent batch size calculation during multi-GPU training. Previously, the number of batches per epoch did not correctly account for the number of GPUs, leading to an incorrect batch size. The fix involves adjusting the data parallel world size and rank when the model parallel unit (mpu) is not defined. This ensures that the number of batches is correctly calculated as the training data size divided by the number of GPUs, aligning with the expected behavior. The changes are made in the deepspeed/runtime/engine.py file.

Created by Genie. You can follow its reasoning on Cosine

Co-authored-by: Genie <genie@cosine.sh>

fix: handle data parallel world size and rank when mpu is None

09e6f7b

Co-authored-by: Genie <genie@cosine.sh>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Batch Size Calculation for Multi-GPU Training #5

Fix Batch Size Calculation for Multi-GPU Training #5

cosineai bot commented Nov 15, 2024

Fix Batch Size Calculation for Multi-GPU Training #5

Are you sure you want to change the base?

Fix Batch Size Calculation for Multi-GPU Training #5

Conversation

cosineai bot commented Nov 15, 2024